White Wine Quality by John Gritch

Univariate Plots Section

Quick Overview

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

I selected free sulfur dioxide as my main feature of interest. The lowest level of free sulfur dioxide in the sample is 2.00 mg / dm^3 and the highest is 289.00 mg / dm^3. The median is 34 mg / dm^3 and three quarters of the wines have a free sulfur dioxide level less than or equal to 46.00 mg / dm^3.


Main Univariate Plots

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

There are 932 observations with free sulfur dioxide levels equal to or greater than 50 mg / dm^3 and 3966 observations with levels below 50. Fifty mg / dm^3 is supposedly the level at which free sulfur dioxide becomes noticeable in the smell and taste of the wine.

## numeric(0)
## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

There are very few wines with with the worst scores (3 and 4) and very few wines with the best scores (8 and 9). Most of the wines have a “medium” score of 5, 6, or 7.

These graphs show total SO2, sulphates and pH - variables, that at this point, I expect to have the most influence on free sulfur dioxide levels.

Total SO2 and Sulphate should contribute to free SO2 directly and in theory, all else being equal the more acidic a solution is the more the chemical equilibrium between free and bound sulfur dioxide is pushed to the unbound free form.

## stat_bindot: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

The vast majority of wines have residual sugar levels less than 20 g / dm^3m. This dot plot of wines with residual sugar greater than 20 g / dm^3 shows the exact location and value of the highest data points.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

Having the opposite effect from acidity, the more sugar molecules that there are in a solution the more the sulfur dioxide (and related ions) bind with the sugar molecules and push the chemical equilibrium toward the bound form.

The fermentation process converts sugar to alcohol so we should see a general inverse relationship between the amount of sugar and alcohol, but the wines will start off and retain differents amount of sugar depending on the grapes and what flavors the vitner is trying to achieve.

Univariate Analysis

What is the structure of your dataset?

The dataset is long format (or tidy format) data with 4898 observations of 13 variables. Eleven of the variables are quantitative measurements of a chemical or physical property, one variable is a subjective labeling of taste quality and the final variable is an explicit observation id.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is free sulfur dioxide. Specifically, I’d like to know what is affecting both the absolute levels and the proportion of unbound or free sulfur dioxide in a wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

The main feature of interest is free sulfur dioxide, but that compound exists in a complicated equilibrium with total sulfur dioxide, sulphates, pH levels, and other molecules present in the wine like residual sugars, acetelaldehyde, and phenols. I am also interested in the relationship between quality and free/total sulfur dioxide.

Did you create any new variables from existing variables in the dataset?

Yes, I created five; quality.ordfactor (which is quality transformed into an ordered factor), SO2.portion.free (which is free SO2 divided by total SO2), pH.bucket (pH cut into 8 unequal buckets), residual.sugar.bucket (variously cut residual.sugar) and quality.bucket (quality cut into 3 buckets to approximate low, medium, and high quality).

I also created two new dataframes (ww.high.sugar and ww.low.sugar) by splitting the data into those wines with a sugar above and below the median value of 5.2 g / dm^3.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The distribution of most (not all) of the variables was approximately to somewhat normally distributed with a fairly consistent trend for the data to be skewed to the right to some degree. Residual sugar was very heavily skewed to the right. At this point I did not transform any of the data.

Bivariate Plots Section

##                                 X fixed.acidity volatile.acidity
## X                     1.000000000   -0.25581431      0.002857966
## fixed.acidity        -0.255814305    1.00000000     -0.022697290
## volatile.acidity      0.002857966   -0.02269729      1.000000000
## citric.acid          -0.149899918    0.28918070     -0.149471811
## residual.sugar        0.006623775    0.08902070      0.064286060
## chlorides            -0.045645192    0.02308564      0.070511571
## free.sulfur.dioxide  -0.011928911   -0.04939586     -0.097011939
## total.sulfur.dioxide -0.161979037    0.09106976      0.089260504
## density              -0.185976097    0.26533101      0.027113845
## pH                   -0.115774132   -0.42585829     -0.031915368
## sulphates             0.009807759   -0.01714299     -0.035728147
## alcohol               0.213656245   -0.12088112      0.067717943
## quality               0.035763247   -0.11366283     -0.194722969
##                       citric.acid residual.sugar   chlorides
## X                    -0.149899918    0.006623775 -0.04564519
## fixed.acidity         0.289180698    0.089020701  0.02308564
## volatile.acidity     -0.149471811    0.064286060  0.07051157
## citric.acid           1.000000000    0.094211624  0.11436445
## residual.sugar        0.094211624    1.000000000  0.08868454
## chlorides             0.114364448    0.088684536  1.00000000
## free.sulfur.dioxide   0.094077221    0.299098354  0.10139235
## total.sulfur.dioxide  0.121130798    0.401439311  0.19891030
## density               0.149502571    0.838966455  0.25721132
## pH                   -0.163748211   -0.194133454 -0.09043946
## sulphates             0.062330940   -0.026664366  0.01676288
## alcohol              -0.075728730   -0.450631222 -0.36018871
## quality              -0.009209091   -0.097576829 -0.20993441
##                      free.sulfur.dioxide total.sulfur.dioxide     density
## X                          -0.0119289106         -0.161979037 -0.18597610
## fixed.acidity              -0.0493958591          0.091069756  0.26533101
## volatile.acidity           -0.0970119393          0.089260504  0.02711385
## citric.acid                 0.0940772210          0.121130798  0.14950257
## residual.sugar              0.2990983537          0.401439311  0.83896645
## chlorides                   0.1013923521          0.198910300  0.25721132
## free.sulfur.dioxide         1.0000000000          0.615500965  0.29421041
## total.sulfur.dioxide        0.6155009650          1.000000000  0.52988132
## density                     0.2942104109          0.529881324  1.00000000
## pH                         -0.0006177961          0.002320972 -0.09359149
## sulphates                   0.0592172458          0.134562367  0.07449315
## alcohol                    -0.2501039415         -0.448892102 -0.78013762
## quality                     0.0081580671         -0.174737218 -0.30712331
##                                 pH    sulphates     alcohol      quality
## X                    -0.1157741316  0.009807759  0.21365624  0.035763247
## fixed.acidity        -0.4258582910 -0.017142985 -0.12088112 -0.113662831
## volatile.acidity     -0.0319153683 -0.035728147  0.06771794 -0.194722969
## citric.acid          -0.1637482114  0.062330940 -0.07572873 -0.009209091
## residual.sugar       -0.1941334540 -0.026664366 -0.45063122 -0.097576829
## chlorides            -0.0904394560  0.016762884 -0.36018871 -0.209934411
## free.sulfur.dioxide  -0.0006177961  0.059217246 -0.25010394  0.008158067
## total.sulfur.dioxide  0.0023209718  0.134562367 -0.44889210 -0.174737218
## density              -0.0935914935  0.074493149 -0.78013762 -0.307123313
## pH                    1.0000000000  0.155951497  0.12143210  0.099427246
## sulphates             0.1559514973  1.000000000 -0.01743277  0.053677877
## alcohol               0.1214320987 -0.017432772  1.00000000  0.435574715
## quality               0.0994272457  0.053677877  0.43557472  1.000000000

Looking at the free.sulfur.dioxide results from the correlation table I noted the linear correlations with residual.sugar (.299), total.sulfur.dioxide (.616), density (.294) and alcohol (-.250) [values rounded].

I am surprised there is not a linear relationship between sulphates and free sulfur dioxide as the data set text file said that sulphates can contribute to free SO2 levels. At this point I am thinking that maybe the relationsip is non-linear or a pattern might emerge if other (yet unknown) variables are accounted for. But, then again could means sometimes won’t, so we will see.

I also initially expected to see a stronger relationship between free.sulfur.dioxide and pH (-.001), but I think this was probably short sided considering the logaritmic nature of pH.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

## 
##  Pearson's product-moment correlation
## 
## data:  ww$free.sulfur.dioxide and (10^(ww$pH))
## t = -0.4965, df = 4896, p-value = 0.6196
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03509507  0.02091502
## sample estimates:
##          cor 
## -0.007095588

A sort of dip or wave appears in the conditional means of the data in both pH and after transforming pH into the linearly scaled count of hydrogen ions.It is not clear to me what this dip might mean, possibly it’s a function of the different types of white wines and the flavors the vitners are trying to produce, or maybe it’s noise. Maybe the cause will become more clear after looking at more variables. Next I will look at the variables that displayed at least some sort of linear relationship with free sulfur dioxide.

Also: The horizontal red line at 50 mg / dm^3 free.sulfur.dioxide is meant to flag where (according to the data set text file) the taste and smell of free sulfur dioxide becomes apparent.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

The first graph of residual sugar and free sulfur dioxide shows what looks like a weak positive linear correlation, but it also show a thick band of overplotting where about half of the wines have a residual sugar count of less than 5 g / dm^3 or so.

In the second graph I zoomed in on the lower values of residual sugar to alleviate the overplotting.

In the third graph I only plotted the top half of the residual.sugar values, just to do it. It appeared that the correlation was stronger between free sulfur dioxide and the higher than median sugar values. I created two data frames seperating the sugar values above and below the median (5.2 g / dm^3) and ran a calculation for pearsons r. The results below.

## 
##  Pearson's product-moment correlation
## 
## data:  ww.low.sugar$residual.sugar and ww.low.sugar$free.sulfur.dioxide
## t = 3.6416, df = 2467, p-value = 0.0002765
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.03377042 0.11224545
## sample estimates:
##        cor 
## 0.07312111
## 
##  Pearson's product-moment correlation
## 
## data:  ww.high.sugar$residual.sugar and ww.high.sugar$free.sulfur.dioxide
## t = 6.1836, df = 2427, p-value = 7.329e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.08519119 0.16350265
## sample estimates:
##       cor 
## 0.1245409
## 
##  Pearson's product-moment correlation
## 
## data:  ww$residual.sugar and ww$free.sulfur.dioxide
## t = 21.9324, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2733819 0.3243875
## sample estimates:
##       cor 
## 0.2990984

Thinking that the very high sugar wine values in the ww$high.sugar results might be lowering the correlation coefficient for the bulk of the data I also calculated Pearson’s r for residual values above 5.2 and below 25 g / dm^3 and got an r of 0.14975. [table not shown]

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

The general trend up is expected from the correlation coefficient, but the stair step pattern is interesting and I will have to come back to that to investigate further in multivariate analysis.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

These plots look pretty much as expected considering the correlation table’s results.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

There seems to be an interesting shift up in the data around .994-.995 g / cm^3 density. To describe it I would say it almost looks like a transorm fault between two tectonic plates.

If I had to guess I would say the relationship in general is being driven by the residual sugars / alcohol complex, with lower residual sugars lowering the density and also the amount of bound sulfur dioxide related ions. But at this point though I don’t know why the shift appears the way it does instead of more gentle and linear slope.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

There’s a slight trend for the free sulfur dioxide levels to go down as alcohol levels rise. My first thought was this was at least partially caused by the tendency for free SO2 levels to decrease when surrounded by higher levels of sugar molecules. And in turn the amount of sugar that remains in solution is inversely related to the amount that is converted to alcohol by the fermentation process.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

## [1] -0.4506312
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

These graphs show the general inverse relationship between alcohol and residual sugar. What was interesting to me was the initial rise in alcohol content in the very lowest sugar wines.

Knowning know that there would be a thick band of overplotting at residual sugars around 5, I decided to flip the graph to get a better look at the scatterplot between alcohol and residual sugars.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

## 
##  Pearson's product-moment correlation
## 
## data:  ww$sulphates and ww$free.sulfur.dioxide
## t = 4.1508, df = 4896, p-value = 3.369e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.03126264 0.08707928
## sample estimates:
##        cor 
## 0.05921725

Taking a closer look at sulphates and free sulfur dioxide then the scatterplot matrix could provide it’s still not apparent to me that there is any sort of real relationship between these two variables. We’ll see if anything shows up in multivariate analysis.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02362 0.19090 0.25370 0.25560 0.31580 0.71050

The variable SO2.portion.free was calculated as free sulfur dioxide / total sulfur dioxide.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

## 
##  Pearson's product-moment correlation
## 
## data:  ww$free.sulfur.dioxide and ww$SO2.portion.free
## t = 76.6688, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7256365 0.7511009
## sample estimates:
##       cor 
## 0.7386321

These graphs and linear correlation table show the rise in the proportion of unbound SO2 as absolute levels of free SO2 rise. (I tentatively assume the total SO2 is also rising in tandem.) Calculating the coefficient of determination as r^2 = .546, it looks like half of the change in SO2.portion.free can be explained by the rise in free SO2 levels.

Unfortunately, there is overplotting at the size the graph is rendered in the knitted HTML file, but when enlarged there are distinct curivilinear trends inside of the scatterplot. The curves can be seen most clearly in the lower left of the graph. I’m not sure if this is the influence of another variable or if it’s an artifact introduced by graphing a porportion against one of it’s constituent parts.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

## 
##  Pearson's product-moment correlation
## 
## data:  ww$residual.sugar and ww$SO2.portion.free
## t = 3.6034, df = 4896, p-value = 0.0003172
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.02345712 0.07932200
## sample estimates:
##        cor 
## 0.05142979

While there was a small relationship between higher levels of residual sugar and higher levels of free sulfur dioxide (which is the opposite of what I expected to see as sulfur dioxide related ions will bind with sugar molecules), we can see that that there may or may not be a real relationship between residual sugar and the portion of unbound SO2.

Considering these facts together what I think this means is that the higher sugar wines, having more sugar (and probably less alcohol as well), need higher levels of sulfur dioxide in general to protect against oxidation, microbial growth, etc. I think it is these higher levels of SO2 that are driving the positive residual.sugar to free sulfur dioxide relationship.

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

From the preceeding three graphs I gleaned nothing.

What most strikes me from these boxplots is that in no quality bracket does the 3rd quartile extend past the threshold of 50 mg / dm^3 free sulfur dioxide.

It looks like the higher the quality of the wine the smaller the variability in total SO2 levels.

And that the lowest quality of wines tend to have less of their SO2 ubound.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

In any given wine, increased residual sugars should decrease the amount of free SO2 in relation to total SO2 as the sulfur dioxide and related ions bind to the sugars. What was interesting was that the absolute levels of free SO2 rose in a stair step like pattern with increased levels of residual sugars. The positive trend makes some intuitive sense under the hypothesis that residual sugars are prone to oxidize or otherwise spoil and you would need more total free SO2 to act as a preservative. So the general rise is somewhat intuitive, but the sharp rise followed by plateau that can been seen in the conditional means is more perplexing.

I have no idea what is causing this pattern. With the exception of the wines with sugars below 2 g / dm^3 the sharp rises tend to occur at the least populated levels of residual sugar. It’s a bit of a subjective call, but there seem to be distinctive bands of small intervals where many wines share a close level of residual sugar. To my eye there are bands around 4-5, 7-8, and 11-15 and the free sulfur dioxide levels also plateau near these levels.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

An interesting relationship was the one observed between residual sugars and alcohol. In general the higher the alcohol the lower the sugar, which makes sense, but the lowest sugar wines (those under 3 g / dm^3) display the opposite effect. Why this is the case, I don’t know. Maybe a little sugar is needed to mellow out the “hotness” of the alcohol? Maybe it’s just this data, I don’t know, but I would love to find out.

What was the strongest relationship you found?

Strictly speaking residual sugar and density had the strongest identifiable relationship. The inverse relationship between alcohol and density was the second strongest.

Multivariate Plots Section

## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

This is residual sugar vs. absolute levels of free sulfur dioxide, facet wrapped by pH. There’s a little funkiness, but across the binned pH levels the general tendency for free SO2 levels to rise as resdiual sugars does can still be seen at most pH levels.

Note that because there were few data points in the most acidic and basic pH bins I collapsed 2.8 and 2.9 into one bin and 3.5, 3.6, 3.7 and 3.8 into one bin.

## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

This is residual sugar vs SO2.portion.free facet wrapped by pH.

## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

This is SO2 portion free vs pH facet wrapped by residual sugars in the range (4,12], with a step of 1.

The dataset text file said that lower pH levels would drive the dynamic equilibrium between free SO2 and bound towards free gas molecules. Taking the dataset as a whole I could not observe this relationship. But I thought it possible that the forcing effect of residual sugar was much stronger than pH and if residual sugars were controlled for there might be an observable relationship between pH and the portion of free SO2.

First I cut the data into quartiles, but there was no observable trend and that method of division seemed very forced or artificial for what I wanted to look at.

Second using what I could see from the bivariate plot of residual sugar to free sulfur dioxide I tried an area with a high increase between residual sugar to free sulfur dioxide (5- 7 g / dm^3 sugar with .25 length intervals), then I tried an area where free SO2 stayed the same as residual sugar increased (8 - 10 g/dm^3 sugar with .5 length intervals). Neither of those graphs showed a consistent inverse relationship between pH and SO2 portion free.

I also tried transforming pH into hydrogen ions, but still no trend was apparent.

## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

These last two graphs are similar to the one that preceed them, but the x axis has been reversed so acidity increases from left to right. It’s hard to say that the proportion of free SO2 increases as acidity increases, but that can certainly be seen at some of the residual sugar levels.

## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

This graph turned out to be very busy and hard to interpret, but I did get the sense that the lower quality wines tended to be on the lower right area of the scatterplot. From this I wanted to look into how the lower quality wines might have lower ratios of free SO2 to total SO2.

This was my first attempt. I could see that the lower quality wines had on average less free SO2 per total SO2 levels, but the graph was still a little busy with lots of overplotting and grayish reds and blues on a gray background.

In this graph it can be seen that the trend for lower quality wines to have lower free to total SO2 ratio does seem to exist. The problem of the first graph, which is somewhat ameliorated in this graph, is that it was hard to see what was happening with the “middle” wines not at the extremes of the scale.

For any given level of total SO2 the high quality wines tend to have higher levels of free SO2. This means less SO2 is bound to “stuff”. The binding might be driven by pH, but given the results we have seen so far I would guess this high quality - high free SO2 relationship is being driven by residual.sugar/alcohol. There is a strong identifiable trend for higher alcohol wines to be rated higher and high alcohol wines tend to have lower sugar, and lower sugar means higher free SO2.

This was just a shot in the dark to see if my pH had any relationship with the curvilinear lines in SO2.portion.free vs total SO2. The reason for these lines still eludes me.

These graphs show how residual sugar and alcohol interact with each other, and alcohol plots against SO2.portion.free and binned quality.

Another look at the relationship between alcohol and density. The higher the alcohol concentration the lower the density.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

This graph mainly shows two things. One, is that the lower alcohol wines tend to be rated more poorly. And two, this dataset has relatively more low alcohol wines than high alcohol wines.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

With these binned and colored histograms I was looking to see if there were any obvious effects of pH on the preceeding histogram. I did not seeing anything that particulary caught my eye.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

These last two graphs were looking for any large scale effects pH might have on SO2.portion.free or residual.sugar, but again I did not glean any new information.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

A particulary interesting graph to me is total sulfur dioxide vs. free sulfur dioxide colored by binned quality. The big clear trend is that the higher quality wines have higher proportions of free SO2. But interestingly the wines with the lowest levels of both free and total sulfur dioxide are almost uniformly low rated wines. Also the wines with the highest free and total sulfur dioxide are low rated.

In that graph it can be seen that above 50 mg / dm^3, which is supposedly the threshold at which you can detect the gas in the smell and taste of the wine there seems to be no bias for the wines to be either high, medium, or low quality. An interesting result for a gas reported to have a “pungent, rotting” smell.

Were there any interesting or surprising interactions between features?

Interesting in the sense that the trend is so clear, was the graph for residual sugar to alcohol colored by quality. Also I was surprised that I could not find a clear consistent relationship between increased acidity and the proportion of free SO2 even after controlling for residual sugar and alcohol.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

Description One

I selected free sulfur dioxide as my main feature of interest and to me that left really two sets of questions. 1) What affects absolute levels of free SO2 and 2) what affects the proportion that is free and unbound.

The thing that has the most influence on absolute levels of free sulfur dioxide is total sulfur dioxide. And from what I can tell the thing that has the most effect on total sulfur dioxide is residual sugars.

This graph shows the relationship between total sulfur dioxide and free sulfur dioxide along with a line representing the smoothed conditional means.

The colors (which are somewhat arbitrary and not equally binned) do show that total sulfur dioxide tends to rise as sugar levels do.

Plot Two

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

Description Two

The first graph showed that there was some relationship between residual sugar and free sulfur dioxide, but in a imprecise way. In a sense it acted like a warning sign calling for more study. This graph shows the relationship in a more meaningful way. Here we can see that free sulfur dioxide levels do indeed rise as residual sugars increase. Interestingly free sulfur dioxide does not rise in a consistent linear way, but in a stair step pattern.

Plot Three

Description Three

I chose this graph, because I think it shows something very surprising. The data set text file said that increased acidity would drive the dynamic equilibrium of free to total SO2 towards free SO2. However, in this graph there is no such relationship and despite my best efforts to control for things like residual sugar - a clear trend still eluded me.


Reflection

This was a dataset of 4898 white wines. The main thing I was interested in was free sulfur dioxide. I started my analysis by looking at the raw, untransformed histograms (or bar graph for quality data). The graphs for the most part looked approximately to somewhat normal and I quickly moved on to bivariate analysis. After creating a correlation table I made note of those variables that showed some linear correlation with free suflur dioxide; residual.sugar (.299), total.sulfur.dioxide (.616), density (.294) and alcohol (-.250) [values rounded and I arbitrarily chose a threshold of .2 for my analysis]. From my background reading I also carried pH forward for more analysis as it had or was supposed to have had an effect on the chemical equilibrium that controls the balance between free and total SO2.

After some stumbling about I began to see the relationship between residual sugars, alcohol, density, total sulfur dioxide and free sulfur dioxide. Residual sugar and alcohol are inversely proportional to each other as during the fermentation process sugar is converted to alcohol. This became very apparent after plotting the conditional means of residual sugars vs alcohol for those wines with residual sugars less than 25 g/dm^3. Of course different wines may begin with different amounts of sugar to begin with, so you can’t perfectly predict one variable from the other.

Furthermore, residual sugar and alcohol act in concert with each other (along with other things) to determine density. The linear relationship between residual sugar and density may be the clearest in the data set. There is an apparent relationship between density and free sulfur dioxide, but I believe this to be a spurious relation driven by the connection of free sulfur dioxide to sugars and alcohol.

The residual sugar / alcohol axis control to a large degree how much total sulfur dioxide was in the wine (presumably to reduce oxidation and spoilage). In general, the more sugar the more total sulfur dioxide. Finally the more total sulfur dioxide, the more free sulfur dioxide (correlation coefficient of 0.616).

I can’t say for certain what affects the proportion of free sulfur dioxide at any given level of total sulfur dioxide, but it is clear that the higher the proportion of free sulfur dioxide the higher the probability that the wine was rated higher than a wine with a lower proportion of free sulfur dioxide. The can be seen fairly well in plot one of the final plots section.

One of the lingering questions I have in this data set is how to isolate the relationship pH has on the proportion of free sulfur dioxide. If I was to analyze further I would systematically control for each variable to determine if it was somehow masking the effect pH concentrations should be having on the proportion of free SO2. There is also a very obvious shift in the scatterplot of free sulfur dioxide and density that perplexes me. For this shift I would like more data on the wines themselves, as I think this may come from how the vitners are making the wines into distinct flavor profiles.